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Abstract. Translating potential disease biomarkers between multi-species 
'omics' experiments is a new direction in biomedical research. The ex- 
isting methods are limited to simple experimental setups such as ba- 
sic healthy-diseased comparisons. Most of these methods also require 
an a priori matching of the variables (e.g., genes or metabolites) be- 
tween the species. However, many experiments have a complicated multi- 
way experimental design often involving irregularly-sampled time-series 
measurements, and for instance metabolites do not always have known 
matchings between organisms. We introduce a Bayesian modelling frame- 
work for translating between multiple species the results from 'omics' ex- 
periments having a complex multi-way, time-series experimental design. 
The underlying assumption is that the unknown matching can be in- 
ferred from the response of the variables to multiple covariates including 
time. 

Keywords: Cross-species translation, Data integration. Hidden Markov Model, 
Multi-way experimental design, Time-series, Translational medicine 

1 Introduction 

Cross-species analysis of biological data is an increasingly important direction 
in biological research. The analysis calls for multivariate methods, since 'omics' 
technologies, such as transcriptomics and metabolomics, enable studying the dy- 
namic response of biological organisms in various conditions, including various 
time points during disease progression. An important reserach problem of trans- 
lational medicine is translating potential biomarkers for disease between species. 

* I.H., T.S and S.K belong to the Adaptive Informatics Research Centre. The work was 
funded by Tekes MASI program and by Tekes Multibio project. I.H. is funded by the 
Graduate School of Computer Science and Engineering. S.K is partially supported 
by EU FP7 NoE PASCAL2, ICT 216886. 
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This would ultimately allow mapping phenotypes between model organisms and 
actual human experiments. 

The basic experimental design in searching for disease biomarkers is the one- 
way comparison of healthy and diseased populations. At the simplest, biomarkers 
can be translated between species by comparing lists of p-values of simple dif- 
ferential expression. Most existing cross-species analysis methods are limited to 
such simple designs |H]. A further limitation of most existing methods is that 
they require an a priori matching of variables (genes) between the organisms. 
Such orthology information is not always available, especially in metabolomics 
where the mapping of metabolites between organisms has barely started, and is 
an interesting research problem in itself. 

Most biological experiments have a multi-way experimental design, where 
healthy and diseased populations are further divided into subpopulations accord- 
ing to additional covariates such as gender, treatment groups, age, measurement 
times etc. A usual approach for dealing with the additional covariates is strati- 
fying the diseased-healthy comparison; a typical example is comparing healthy 
and diseased males and females separately. The standard statistical methods 
for properly dealing with multi-way designs, are Analysis of Variance (ANOVA) 
and its multivariate generalization (MANOVA) . While studying the effects of all 
the covariates on the data makes the analysis slightly more complicated, more 
information can be gained from each species to be used in the translation. Un- 
fortunately, there exist no earlier proper tools for utilizing the information of 
the effects of multiple covariates on the data in cross-species analysis. 

Furthermore, time-series experiments are becoming more and more common 
in clinical studies searching for disease biomarkers. Whereas in some cases the 
measurement times of such experiments are regular and allow a "neat and easy" 
data analysis, this is often not the case. In clinical follow-up studies, such as [13], 
measurement times are often irregular due to practical reasons of data collection, 
and there are missing time points. Also, in follow-up studies spanning timescales 
of years, individuals have been shown to develop into metabolic developmental 
states at an individual pace 112]. In addition, life spans of different organisms, 
such as man and mouse, are very different resulting in very different measure- 
ment times. These complications call for a possibility to align the time-series. 
All of these factors combined cause remarkable challenges for cross-species data 
analysis. 

Instead of searching for single molecule biomarkers, which have a high risk 
of false positives, we concentrate on finding combinations of similarly-behaving 
biomarkers, which is a way towards treating a transcriptomic or a metabolic 
profile as a fingerprint of the clinical status of the organism. For this, multivari- 
ate statistics is needed. In this paper, we will now show how the data analysis 
problem can be formulated as a new multivariate ANOVA-type model in the 
case where data comes from multiple sources (species) and one of the variables, 
namely the time, has a previously unknown structure (alignment). 

In this paper, we will present a formal framework for cross-species analysis 
of 'omics' data in the case of a multi-way, time-series experimental design. This 



Translating biomarkers between multi-way time-series experiments 



3 



methodology can be directly used for finding previously unknown matcliing of 
groups of variables between the species based on the data. In contrast to many- 
step approaches, the whole modelling is done in a single, unified, multivariate 
Bayesian model. The framework has estimation of uncertainty and dimensional- 
ity reduction built-in to overcome the main challenge: high dimensionality and 
small sample-size. The central underlying assumption is that the actual link be- 
tween the variables of the different species can be inferred from a similar response 
of the variables to multi-way covariates. 

Previous work in cross-species analysis 

Until now, meta-analysis of microarray data has been the major approach to 
cross-species studies in biology j8j. Plenty of meta-analyses have been done by 
either comparing lists of differentially expressed genes between several species, or 
by comparing expression levels of known gene orthologs between the species. So 
far, mainly highly-controlled cell cycle studies have been analyzed across several 
species and no attention has been paid to multi-way designs. 

One step towards translational analysis has been taken by Lucas et al. |10lllj . 
To help finding biomarkers from an in vivo experiment, they incorporated prior 
knowledge from results of an in vitro study analyzed with the same method. 
Like ours, their approach is based on generative Bayesian modeling of factors 
and can handle multiple covariates. Their approach does not, however, consider 
time-series cases with unaligned time nor the case without any a priori matching 
between the variables. 

A probabilistic model based on Gaussian random fields has been applied 
to two-species expression data [9^. This work combined differential expression 
scores from different species, cell types, and pathogens utilizing homology infor- 
mation. In [6 the task was to query large databases of micro-array experiments 
to identify similar experiments in different species, by utilizing partially known 
orthology information. In [7J time-series micro-array data from multiple species 
was used to discover causal relations between genes to discover conserved regu- 
latory networks. Also this approach naturally needs a priori known matching of 
orthologous genes. 

A standard method for finding similarities between several data sets is canon- 
ical correlation analysis (CCA)|2]. CCA assumes paired samples over the data 
set and thus is not directly applicable for the translation problem, where the 
samples (patients) are different over species. A simple iterative method for pair- 
ing genes has been developed in |14ll5j . In the case the genes are samples in the 
data matrix, optimal pairing of genes is sought by maximizing the dependency 
between the data sets estimated by CCA. A very similar method was recently 
used for regulatory network inference [1] . No prior matching of variables or sam- 
ples is assumed, and the method attempts to find both iteratively by alternating 
between matching of variables and matching of samples using a closely related 
method Co-Inertia analysis. These methods do not, however, take into account 
covariate information (including measurement time) of the samples, nor different 
time resolution of the covariates. 
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In summary: none of the existing approaches can take into account a muhi- 
way time-series covariate structure or exploit it to find previously unknown 
matchings between the variables without any a priori known matching infor- 
mation. 

2 Model 

Our method addresses the problem of translating biomarkers between multiple 
species from multi-way time-series experiments with previously unknown match- 
ings between the variables (metabolites). We do not assume the samples have a 
pairing but our main assumption is that there is a similar multi-way time-series 
experimental design in both experiments. 

A simple data analysis procedure towards this goal would be doing the uni- 
variate multi-way ANOVA analysis separately on the two data sets of the two 
species, and comparing the lists of p- values afterwards as a meta-analysis step. 

In the two-way case, to explain the covariate-related variation in one species, 
say X, the following linear model is usually assumed: 

Xj|(a,h) =M" + aa+/36 +(a/3)a& + ej. (1) 

Here is a continuous- valued data vector, observation number j, the /x^ is 
the overall (grand) mean, the a and b (a = 0, ... A and b = 0, . . . S) are the 
two independent covariates, such as disease and treatment. The a% and (3^ are 
parameter vectors describing the covariate-specific effects, called main effects. 
The {a./3)^ij is a parameter vector describing the interaction effect. 

Instead of searching for single-molecule biomarkers, that have a high risk of 
false positives, our approach is multivariate, concentrating on finding combina- 
tions of biomarkers. 

In order to tackle high dimensionality and scarcity of observations, we as- 
sume that there are groups of similarly behaving variables (metabolites) in each 
species. We then search for correlated groups of metabolites sharing a similar 
response to external covariates. These correlated groups (clusters) are therefore 
assumed to be shared between the species. Underlying this process is the assump- 
tion that similarity of multi-way behavior of groups of metabolites indicates a 
cross-species mapping of the metabolites. 

We have taken an ambitious goal by building a unified Bayesian model that 
integrates the separate multi-way experiments from multiple species. The model 
can be learned jointly by Gibbs sampling. 

From the point of view of ANOVA-type modelling the question is how to do 
multi-way modelling when the data comes from different sources with different 
variables {e.g., man and mouse having different metabolites). The solution is to 
consider data "source" as an additional covariate in the multi-way analysis |5]. 
From the data integration point of view the task is to find dependencies between 
the data sets when neither the variables nor the samples have been paired. 
There is, however, a shared multi-way covariate structure in the data sets, and 
it is utilized to find the mapping of groups of variables. We study additionally 
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the case where one of the covariates, time, has a previously unknown structure 
due to unknown time aUgnments. It will be shown that the alignments can be 
found simultaneously within the whole unified model. 

The model we develop for this task is an extension of our recently pub- 
lished multi-way modelling methods |4|3l5j . In we presented a method for 
multi-way ANOVA-type modelling in "small sample-size n, high dimensionality 
p "-conditions in the case of standard covariates, such as disease, treatment, 
and gender. The solution is to use regularized factor analysis for dimension- 
ality reduction, such that each variable is assumed to come from one factor 
only. The effects of multi-way covariates a.a, Pi,, {af3)ab are then estimated in 
the low-dimensional latent factor space. Each latent factor represents a group of 
correlated variables. The model is 

X, ^Miti + Vxf\A) . (2) 

Here Xj is a p-dimensional data vector, V is the projection matrix, and x^"* 
is the low-dimensional latent variable, A is a diagonal residual variance matrix 
with diagonal elements af. 

In [2j , we further extended this framework into integrating data sources with 
paired samples, such as having measurements from multiple tissues of each in- 
dividual. In [S], we first extended pi into time-series cases with unknown align- 
ments, such that these alignments can be learned simultaneously with the multi- 
way modelling task. In [5] , we also presented the basic principle and a simplified 
model of how the multi-way modelling framework can be extended into trans- 
lational modelling. This case is much more difficult than 3], because samples 
have not been paired between the data sources; For example, the pairing of one 
time-point of an individual test mouse and a time-point from one of the human 
patients cannot be assumed. In [5] we concentrated on finding a shared response 
of the variables to one covariate only; the aligned time. In this paper we now 
proceed by presenting the full translational model where, in addition to aligned 
time, there are other covariates, such as disease. Also, in this paper we separate 
the time- and disease behavior into shared and species-specific effects. 

2.1 Modelling time-series measurements from multiple populations 
with regular measurement times 

Let us now consider modelling data from time-series measurements from diseased 
and healthy populations in one species. If the measurement times are fixed and 
individuals can be assumed to have similar aging development, the data analysis 
can be seen as a two-way design and modelled with a linear model. When mod- 
elling the effects of time and other covariates on low-dimensional latent factors, 
each factor representing one correlated group of variables [4], we can use the 
model 

Xj-''*|(t,&) = at + /3(, + iaf3)tb + noise. (3) 
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We denote from now on the time-point by t, the disease status by & = {0, 1}, the 
effect of time by a.t, the effect of disease by /3j, and the interaction of time and 
disease {a.(3)tb- The last one is the most interesting, denoting the time-dependent 
disease effects. 

2.2 Modelling time-series measurements from multiple populations 
with irregular measurement times 

This work is motivated by the fact that in many real-world sparsely collected 
time-series datasets, especially from large-cohort human clinical studies, mea- 
surement times can be irregular within and between individuals; one particular 
state-of-the-art clinical lipidomic study is [13], on which we now concentrate. 
This study followed a set of patients; some of them remained healthy, some 
developed into type 1 Diabetes. Furthermore, it was shown in [H] that individ- 
uals progress into different age-related metabolic states at their indivual pace. 
This phenomenon can be modelled by assuming that there are underlying la- 
tent metabolomic development states and individuals progress into these states 
in their individual pace [12]. The underlying states were modelled by Hidden 
Markov Models (HMM), where the observed metabolic profiles are assumed to 
be emitted by the underlying states. This modelling assumption also deals with 
the problem of aligning irregular measurements. 

The important problem now is how to separate the effects of disease from the 
individual aging changes. In [12] the HMM model was trained separately for the 
healthy population and the diseased population, and such an approach cannot 
fully answer this question. 

The model [5] that can separate these two effects is 

*UtateO-,t)=s,b - ^f{as +I3b + ial3)sb, I), (4) 

where s is the latent development state (HMM-state), a^, is the effect of aligned 
HMM-time and {aj3)sb is the most interesting effect, the interaction of "HMM- 
time" and disease. We showed in [5] that it is possible to simultaneously estimate 
the terms in the model Q and learn the alignments of the time-series into 
the HMM development states. We assume a linear HMM-chain, allowing only 
self-transitions and transitions into the next state. The probability of the t:th 
time-point of individual j being in state s is 

t) ^s)^p {s{j, t)\s{j, t-l))p {xf\as +I3b + {af3)sb) P {s{j, t + t)) . 

(5) 

If more covariates are present in the study, it is straightforward to extend the 
model Q by additional terms. 

2.3 Translating biomarkers between species from time-series 
measurements from multiple populations with irregular 
measurement times 

We now propose that translation of results between multiple species, from multi- 
way time-series experiments, should be done by finding groups of similarly behav- 
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ing variables (metabolites) in both species that respond similarly into multi-way 
covariates. A data matrix representation of the data analysis problem and plate 
diagram of the Bayesian model are shown in Figure [T] We introduce a modeling 
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Fig. 1. a) Data matrix representation of the data analysis problem, (b) Plate diagram 
of tiie Bayesian graphical model. The set 0s = {cts, af , a^, (q:/3)s6, (Q:/3)Ji,, (a/J)^^} 
contains all latent variables describing the corresponding HMM state. The state of 
each sample is determined by an observed covariate h and an unobserved covariate s. 



framework that can do this task, even in the complicated case of having irregular 
time-series measurements that require alignment into hidden metabolic states. 

The model makes the very flexible assumption [5 that the observed data 
vectors in the two species with different variables, x and y, are generated by 
latent effects according to 

X = + r (a, + + (a/3).fc) + T (a? -t- (a/3)f,) + e, 

y = /x^ + P{oc, + /3, + (a/3),b) + P{a}>^ + /3,« + (a/3)'4) + e , (6) 

where a^, /3j, (ct/J)^;, are the shared effects of HMM-time, disease and inter- 
action of HMM-time and disease, respectively, , and (q:/?)^^, and are the 
species -specific effects of HMM-time, disease and interaction of HMM-time and 
disease, respectively, and likewise for species y. The variable-spaces of x and y 
are different, and therefore also the dimensions of the latent variables of x^"* and 
yjat representing groups of correlated variables in both species, need not match. 
For this reason, the latent effects of the covariates have to be projected into the 
actual observed data spaces x and y through previously unknown projections 
and that will be learned jointly with the model. 

The translational problem now becomes: Does some dimension of x^°* re- 
spond to the covariates s and h similarly as one of y'°*. ff it does, one can 
represent this behavior with shared effects (as,/3j, (Q:/3)sf,). The interpretation 
is that a cluster of correlated variables in x represented by the dimension of 
xj°* matches with a cluster of correlated variables in y. Such dimensions can 
be considered as multi-species biomarkers. If there is no match, the response to 
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the external covariates is modelled by species-specific effects (af, /3^, (a/3)^j)). 
With this framework, we are able to estimate confidence of the shared effects. 

2.4 Matching problem 

We propose the following measure for quantifying the quality of the match be- 
tween two clusters from different datasets: whether the matching is better than 
an average matching (over other pairs). On a meta- level the measure is intu- 
itively appealing in the spirit of permutation tests, and it can be formulated 
more exactly by specifying what we mean by "better." We will use probabilistic 
modeling to measure relative goodness below. 

The matching problem of the clusters is a combinatorial problem, where pos- 
sible configurations of pairs need to be evaluated, judging for each pair how 
similarly they respond to multi-way covariates. We resort to an iterative algo- 
rithm that attempts to change the matching of one cluster at a time. Choosing 
a candidate pair, we compare its goodness to an average pair (uniformly se- 
lected having one same endpoint), and accept forming a link between them by a 
Metropolis criterion that compares the likelihoods of the two pairings. A reverse 
operation is to attempt to break a link by comparing an existing link between 
two clusters to an average (random) pair. The goodness (likelihood) of a pair is 
evaluated by a shared multi-way model between the clusters. Clusters with no 
pairs are modelled as specific effects. Averaging over the iterations, we can esti- 
mate the probability for matchings and the "shapes" of the multi-way effects. A 
high probability of a specific pair indicates a found matching. A high probability 
for being modelled as a specific effect indicates the cluster has no pair. 

3 Results 

We illustrate the method with generated data and lipidomic time series data 
with a two-way, time-series experimental design. In the experiments, we neglect 
the static disease effects f3i, and assume all the disease effects are due to HMM- 
state-specific disease effects {cxl3)sb- 

3.1 Generated data 

We generated from the model two data sets X and Y with no pairing of samples 
but only a shared two-way design. There arc 11 separate time-series ("patients") 
in both of the two data matrices, each series consisting of 5 to 15 time points. This 
results in 108 and 115 samples, and data matrices are 200- and 210-dimensional. 
The latent factors x'^' and y^^* arc 3- and 4-dimensional, respectively. The data 
in each population is generated from a shared HMM-chain with 5 states. We 
generate three covariate effects into the data: (i) a shared temporal effect cXg as 
0, -1-0.5, +1, +1.5. +2 in one cluster of data set X and one cluster of data set Y, 
(ii) a shared interaction effect (a/3)s6 as 0, —0.5, —1, —1.5, —2 in another cluster 
of X and another cluster of Y, (iii) a specific temporal effect a.^ as 0, —0.5, — 1, 
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— 1.5, —2 in yet another cluster of data set Y. Patterns (i) and (ii) are the only 
behavior that is shared between the two data sets representing the two species, 
and the model should be able to learn the correct pairing of variable clusters 
based on this similarity. We use the proposed model to jointly align the samples 
into HMM states, learn the clusters of variables, search for the possible pairing 
of the clusters between the two data sets, and model the ANOVA-type effects 
acting on the found clusters. We chose a priori a model with 5 HMM states. 
For the analysis we discarded the first 5 000 samples from the Gibbs sampler as 
a burn-in period and after this ran the model for another 5 000 Gibbs samples 
of which we saved every 5th sample to obtain a total of 1 000 Gibbs samples to 
approximate the underlying posterior distribution. 

Our model finds the previously generated clusters without mistakes (see 
Fig. [2]). It also connects the X-cluster 1 to Y-cluster 3 and X-cluster 3 to Y- 
cluster 2 by estimating them to be linked in 89.9 % and 47.9 % of posterior 
samples, respectively. The generated shared and specific effects are detected as 
expected and no false positive effects are found. 



3.2 Lipidomic time-series data 

We then validate the model with a real lipidomic data set, consisting of time- 
series measurements from a recently published type 1 diabetes (TID) follow- 
up study |13| . In the data, there are 71 healthy patients and 53 patients that 
later developed into TID. For each time-series, there are 3-29 time points from 
irregular intervals. For this validation study, we randomly divide the individuals 
into two non-overlapping data sets X and Y. We then study, whether we can 
find a similar response to the covariates disease and HMM-time, and a matching 
between those clusters from the two data sets that respond to these covariates. 

Again, we a priori chose a model with 5 HMM states. We learned a two-way 
model for the data, where the first way is "HMM-time" and the second way 
is "disease". Lipids were assigned into 6 clusters. We discarded 10 000 burn- 
in samples and collected 10 000 Gibbs samples of which we saved every 10th 
sample to obtain a total of 1 000 Gibbs samples approximating the posterior 
distribution. 

The model integrates the datasets X and Y by learning the HMM-time effects 
as and interaction effects {a/3)sb- Clusters of lipids and effects found (Fig. [3| 
were similar as in our previous publication [S]. The model finds three matching 
clusters between X and Y responding similarly to the external covariates, thus 
linking the same lipids between the two subsets of data without prior knowledge. 
The corresponding clusters were paired in 10-12 % of posterior samples, which 
is higher than for other combinations of pairing (0-11 %). Naturally the method 
does not find matchings for clusters that do not respond to external covariates. 

A group of triglycerol (TAG) and two groups of glycerophosphocholine (Gp- 
Cho) were strongly paired to their counterparts. On real data, the result is 
naturally not as good as on generated data, since the effects are weaker. 
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Fig. 2. Pairing results from generated time-series data. Shown are the main effects 
(HMM-time; left) and interaction effects (right). Topmost, the generated effects are 
illustrated. Lower, the table of estimated effects shows shared (top-right area) and 
specific (left column and bottom row) effects for both types of effects. Rows and columns 
in the area of shared effects correspond to clusters in data sets X and Y, respectively. 
The found true pairing is highlighted by a red box. Value on top of each plot shows 
the percentage of posterior samples where the effect is found. An effect above or below 
zero is considered significant. 
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Fig. 3. Pairing results from lipidomic time-series data. Only the main effects ocs are 
shown. The true pairings found are highlighted by red boxes. The table shows shared 
(top-right area) and specific (left column and bottom row) effects estimated by the 
model. Rows and columns in the area of shared effects correspond to clusters in data 
sets X and Y, respectively. An effect (boxplot) consistently above or below zero is 
considered significant. Value on top of each plot shows the percentage of iterations 
where the clusters were matched. 



4 Discussion 

We presented a novel method for translating biomarkers between multiple species 
from multi-way, time-series experiments. The case we addressed is when there 
are no a priori known matching between the variables in the two datasets, but 
only a similar experimental design. The method estimates ANOVA-type multi- 
way covariate effects for clusters of variables, and identifies and separates effects 
that are shared between the species and effects that are species-specific. 

For biological data, the task is harder than for generated data. Probabilities of 
matched shared effects were lower for biological data, which is caused by the fact 
that the covariate effects in a biological experiment are weaker, making it more 
challenging for the method to find the similarity between the data sets. The study 
with lipidomic TID data showed, however, that the method is able to extract 
similarities between non-paired biological data sets. The approach presented can 
be naturally extended to multiple ways (covariates) and to multiple species. 
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